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We show that under the matrix product state formalism the states produced in Shor’s algorithm 
can be represented using 0(max(4Zr^, 2^*)) space, where I is the number of bits in the number to 
factorise, and r is the order and the solution to the related order-finding problem. The reduction in 
space compared to an amplitude formalism approach is significant, allowing simulations as large as 
42 qubits to be run on a single processor with 32GB RAM. This approach is readily adapted to a 
distributed memory environment, and we have simulated a 45 qubit case using 8 cores with 16GB 
RAM in approximately one hour. 
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I. INTRODUCTION 

Running large-scale quantum algorithms on a quan¬ 
tum computer is still a distant goal. The qubit [l| is a 
two-level system which can be in superpositions of both 
states, and is the fundamental unit of information for 
quantum computation. Not only is creating large num¬ 
bers of qubits with precise control difficult, so too is pre¬ 
serving the delicate entanglement. Thus, classical sim¬ 
ulation is essential to study the tolerance of quantum 
algorithms against qubit decoherence and control errors. 

Shor’s algorithm for prime number factorisation 
is a well known quantum algorithm, and has been simu¬ 
lated extensively in the amplitude formalism M. The 
computational resources grow exponentially with system 
size and the simulation results are easily verified, making 
it an ideal test candidate. Amplitude formalism simula¬ 
tors have simulated systems with as many as 42 qubits on 
massive supercomputers 0,i, using highly parallelised 
software. These calculations represent the current limit 
using the amplitude formalism. 

Here we consider a different approach to classical quan¬ 
tum algorithm simulation. The matrix product state 
(MPS) formalism Q is a specialised form of tensor net¬ 
work where the interacting bodies are confined to one 
dimension. Briefly, each quantum state is represented 
as a product of matrices. The matrix dimensions depend 
on the entanglement in the system, and exceptional cases 
may lead to an efficient simulation. Fortunately for quan¬ 
tum computing, this is not the case for Shor’s algorithm. 
Nonetheless, MPS can be useful for circuit simulation. 

In this paper, we shall show that the space require¬ 
ments for running Shor’s algorithm using a MPS simu¬ 
lator is 0(max(4Zr^, 2^^)), where I is the number of bits 
in the number N to factorise, and r is the order of the 
related order-finding problem. The space reduction com¬ 
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pared to the 0(2^^) of amplitude formalism simulators 
allows us to simulate a 42 qubit case on a single proces¬ 
sor with 32GB RAM in a matter of hours. We discuss 
optimisation techniques that arise from circuit. We also 
reason that MPS simulators are well suited to parallelisa¬ 
tion, and demonstrate near ideal scaling using MPI. Our 
largest simulation instance consisted of 45 qubits, using 
8 processor cores with 16GB RAM in approximately one 
hour (real time). 

This paper is organised as follows. Section [H] is a re¬ 
view of Shor’s algorithm. Section Hill is a review of MPS. 
Section |TV] analyses the space performance of Shor’s al¬ 
gorithm when stored as a MPS. Section |V] describes our 
implementation in detail, and tabulates single process 
benchmarks. Section I VII discusses a multi-processor im¬ 
plementation, and benchmarks indicate good scaling be¬ 
haviour. 


II. REVIEW OF SHOR’S ALGORITHM 

We shall give a brief review of Shor’s algorithm ii- 
Given an Z-bit number N = p x q, where the integers p 
and q are co-prime, the task is to determine the factors 
p and q. This is canonically achieved by reduction to 
a related order-finding problem: for a chosen integer x, 
where 1 < a: < V, find the order r oi x modulo N. The 
order r is the smallest positive integer satisfying 

a;*' = 1 (mod N). (1) 

It can be shown that for even valued r and x^!'^ 7^ 
(mod V), one can find the factors of N by computing 
the greatest common denominator gcd(a;^'’'^±l, V). This 
can be performed efficiently by Euclid’s algorithm. The 
quantum order-finding circuit (Fig [Ij that lies at the 
heart of Shor’s algorithm allows a quantum computer 
to efficiently determine the order r, using qubits and cir¬ 
cuitry growing only polynomially with 1. It consists of 3/ 
qubits divided into two registers: the upper 21 qubits are 
initialised in |0), and the remaining I qubits are initialised 
in |1). In the figure, H = {ax + af)l'j2 is the Hadamard 
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by the size of the system, regardless of the state stored. 
The MPS formalism is a different way to store the state, 
where the space required grows as the state becomes en¬ 
tangled, by using the ability to write a general quantum 
state I'l/') as 

1^) = ■ ■ •) |a) \b) |c) • • • . (5) 

abc--- 


FIG. 1. Schematic of the order-finding circuit which efficiently 
determines the order r. To factorise an i-bit number N, one 
uses an upper register with 21 qubits, a lower register R with I 
qubits, and U \ a) = \ax mod N) for some integer 1 < a; < A'^. 
QFT is the quantum Fourier transform. 


operation, where and cr^ are the Pauli matrices, the 
unitary U performs 

U |a) = |aa; mod N ), (2) 

and QFT denotes the quantum Fourier transform [ 13 . 
Neglecting normalisation constants, one can show that 
the state created after the controlled-C/s have been ap¬ 
plied is 


22i_i 

IV') = ^ I*) mod N) . (3) 

i=0 

We shall always suppress the Kronecker or tensor prod¬ 
uct between two kets (i.e. |a) \b) = |a)(8)|6)). Substituting 
in Eqn [1] allows one to rewrite this as 


= ^ j ^ \jr -t- i) 1 \x'‘ mod iV) . 


( 4 ) 


2=0 \i=o 


After applying all of the controlled-C/s, one can mea¬ 
sure and discard the lower register, leaving a system con¬ 
taining only 2/ qubits. The QFT produces a probabil¬ 
ity distribution in the upper register with high probabil¬ 
ity densities in states around integer multiples of 2^*/r. 
The results of a few measurements from this probabil¬ 
ity distribution can be classically processed into r via 
the continued fractions algorithm. Since it is possible to 
construct circuits to efficiently perform all powers of U 
P P and the QFT P, it is therefore possible for a 
quantum computer to efficiently determine the order r 
and hence factorise large numbers. 


III. REVIEW OF MATRIX PRODUCT STATES 

Here we briefly review the MPS formalism Q. The 
reader may be familiar with amplitude formalism state 
vector simulators, where one stores the state as a col¬ 
lection of complex coefficient-index pairs. Typically this 
is implemented as a complex number along a contigu¬ 
ous array. This approach requires space governed only 


Here |a), |6), |c), etc. are the bases of the subsystems 
of the state, and Aa, Bh, Cc^ etc. are matrices associ¬ 
ated with the respective subsystems. That is, the ampli¬ 
tude Uabc- - of the state is stored as the matrix product 
AaBbCc ■ ■ ■ ■ The indices a, b, c, etc. span the dimensions 
of their respective subsystems (e.g. for a qubit, the index 
can be zero or one), allowing for straightforward general¬ 
isations to d-level subsystems (i.e. qudits). Indeed, any 
two or more neighbouring qudits in a MPS—that is, their 
matrices are adjacent in the product—can be combined 
into a single higher-dimensional qudit simply by the eval¬ 
uating the different matrix products and redefining the 
subsystems (e.g. Da = Dab = AaBb). 

A single-qudit gate Ui is applied by mapping Ui be¬ 
tween the corresponding elements in the matrices associ¬ 
ated with that qudit only. Many-qudit gates (assuming 
all involved qudits form a contiguous sequence) can be 
applied by first contracting those qudits together, thus 
forming a single qudit, and then applying the equivalent 
single-qudit gate on the combined system. Swap gates 
can be used to rearrange qudits amongst one another 
when interacting initially distant qudits. Finally, one can 
separate the composite system back into the original par¬ 
titions, or as desired, by applying a matrix decomposition 
algorithm on an specifically arranged matrix: 
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The left-hand side shows the arrangement of the matrices 
Dab, where 0 < a < i and 0 < 6 < j, of the composite 
qudit within a single m x n matrix, for decomposition 
into a left (i -I- l)-dimensional qudit and a right (j -|- 1)- 
dimensional qudit. A matrix decomposition algorithm 
returns the two right-hand side matrices (or something 
similar), which have dimensions m x k and k x n, where 
k is ideally the rank of the left-hand side matrix (the 
number of linearly independent rows or columns). After 
the decomposition, one can readily extract the matrices 
Aa and Bb- Rank-revealing decompositions such as the 
singular value decomposition (SVD) and the pivoted QR 
decomposition (RRQRF), although not strictly necessary, 
are beneficial as they minimise k and, hence, the required 
storage space. 

Note that, due to numerical imprecision, the rank is 
normally understood to be the number of eigenvalues 
above a given threshold returned by the SVD, and one 
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can introduce a form of approximation by raising the 
cut-off value. MPS was developed as an efficient method 
for simulating systems where the number of non-trivial 
eigenvalues is limited. However, for the controlled-C/ 
gates in the order-finding circuit, the eigenvalues are ei¬ 
ther clearly zero or non-zero, thus increasing the cut-off 
value does not provide any benefit (without invalidating 
the results). 
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IV. ORDER-FINDING CIRCUIT SPACE 
ANALYSIS 

Amplitude formalism state vector simulators are 
severely limited by the number of qubits in the circuit. 
In the case of the order-finding circuit, the most resource 
intensive state under the amplitude formalism is the state 
as one applies the final controlled-f7 operation (Eqn|4l), 
requiring 0(2^*) space for a standard implementation, 
and 0(2^^) space for one using sparse storage. 

We now show that the state in Eqn 0] can be stored 
much more economically by using the MPS formalism. 
We wish to first divide this state into two qudits: a left- 
hand side ket which is the contraction of the upper 2/ con¬ 
trol qubits, and a right-hand side ket for lower I qubits. 
A general state with this division can be written as 


TABLE I. The ranks across the MPS after completing 
the order-finding circuit up to and including the indicated 
controlled-unitary for N = 65, x = 2, I = 7. Increasing time 
flows down the table. Adjacent pairs of numbers define the 
dimensions of the intermediate matrix. The lower register is 
stored at the rightmost end, so that the second last number 
in each row is also equal to the number of non-trivial states in 
the lower register. The underlined value is the order r = 12. 
The final order of qubits is qo,qi,- " > </i 3 ■ 
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2^‘-l 2‘-l 

i^)= E (7) 

a —0 6—0 

Here Aa and are complex row and column vectors 
respectively, so that the product AaB^ is a scalar. We 
note that in Eqn |4l the summation index i spans r val¬ 
ues only, corresponding to the r unique values generated 
by X* mod N. Erom this comparison, one concludes that 
there are only r non-trivial Bb matrices in Eqn[71 Eor lin¬ 
ear independence, the Bb matrices must contain at least 
r rows. The Aa matrices must possess as many columns 
as Bb does rows to satisfy the matrix product. It follows 
that for this particular decomposition, the state can be 
expressed with 0(2^* [1 x r] -|- r [r x 1]) space. 

One can further decompose the left-hand side ket down 
to individual control qubits. Each of the 21 control 
qubits’ two bases can be represented by matrices no 
greater than r x r. This is easily seen from the structure 
of the order-finding circuit; the control qubits only in¬ 
teract with the lower register, and not with one another. 
Using this decomposition, Eqn0]can be represented using 
0{2l • 2 [r X r]) space. 

To illustrate the above analysis, see Table HI Here 
we have graphed the ranks across the MPS after each 
controlled-!/ gate is applied for the case N = 65, x = 2, 
and r = 12. We shall describe our implementation in 
detail in Section |Vl Eor now, it suffices to know that ad¬ 
jacent pairs of numbers define the dimensions of the in¬ 
termediate matrix. One can see that no matrix exceeds 
dimensions r x r. 


TABLE 11. Applying controlled-unitaries in decreasing powers 
of two lead to equal or lower ranks across the MPS (cf. Table 
HI), thus improves both space and time performance. The final 
order of qubits is gia, (J 12 , • • • ,qo- 


The remainder of the circuit can be completed by first 
measuring and discarding the lower register, leaving be¬ 
hind a 21 qubit system. All of the control qubits can 
then be contracted to a single qudit. At this point, the 
MPS representation becomes equivalent to the amplitude 
formalism representation, thus it requires 0(2^’") space. 
Therefore, the entire order-finding circuit can be com¬ 
pleted using 0(max(4/r^, 2^*)) space. Further space re¬ 
duction is possible by using sparse matrices—after read¬ 
ing Section El it should become obvious that the space 
required before the QFT is 0(lr). 

Note that the of value r will arise naturally as the 
simulation is run. It is interesting to observe that in r < 
log 2 N cases, one can efficiently simulate Shor’s algorithm 
by using the semi-classical QFT [H, Hi. However, it is 
exceedingly rare to choose an x with r < log 2 N [H. 
Furthermore, for the case x = A — 1, it is easy to check 
that r = 2, but the algorithm does not return any non¬ 
trivial factors. 


V. IMPLEMENTATION AND BENCHMARKS 

We now describe the details of our simulations and 
present our benchmarks. Preliminary processing steps 
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required before the order-finding circuit available else¬ 
where (e.g. see Q) and will not be repeated here. We 
shall divide the circuit into three distinct steps: (1) the 
application of the controlled-C/ gates; (2) the measure¬ 
ment of the lower register; and, (3) the final quantum 
Fourier transform. 

The circuit consists of 21 control qubits, labelled qi, 
0 < i < 21, such that the qubit qi controls the gate 
. The remaining I qubits form the lower register R. 
They are always contracted into a single 2bdimensional 
qudit (technically, the circuit also works with a single 
iV-dimensional qudit). Following our space analysis, we 
choose only to store the non-trivial matrices of R (at most 
r), and furthermore, we choose to always keep it at the 
rightmost end of the product, so that these matrices are 
always column vectors, each containing at most r rows. 

The state is initialised in |0) |1). The H gates may 
be applied immediately, or combined into the following 
controlled-17 gates. We have opted to implement the 
latter. In both cases, all of the matrices representing 
the state are, at present, simply scalars—either 0, 1, 
or 1/ 1 / 2 —and thus the internal ordering amongst qubits 
and qudits is negligible. Note, however, that for a general 
entangled state, the ordering is of critical importance. 


A. The Controlled-!/ Gates 

In order to apply the controlled-!/^* gate, we first shift 
the control qubit qi directly to the left of lower register R. 
Since qi is in a product state, or more correctly, its matri¬ 
ces are currently scalars, one can commute it through the 
matrix product with ease, and resize it (i.e. multiplying 
it by an appropriately sized square identity matrix) to 
retain a well-formed MPS. 

Now one can contract qi with R, and subsequently ap¬ 
ply the controlled operation. Let this combined system 
have matrices denoted by Dab, where a € {0, 1 } and b are 
the non-trivial states of R. Then, in order to separate the 
system into two partitions, with qi on the left and R on 
the right, one arranges the matrices Dab into a single ma¬ 
trix as shown on the left-hand side of Eqn|6l In practice, 
instead of performing each of these steps independently, 
one can perform the contraction, apply the controlled-!/ 
gate, and arrange the Dab matrices in a single swift blow 
for efficiency. Additionally, the H operation on qi, which 
simply takes the state from |0) to (|0) -I- |1)) /\/2, is also 
readily combined into the above steps. 

A matrix decomposition algorithm takes the matrix 
on the left-hand side of EqnlHland produces the two ma¬ 
trices on the right-hand side. These contain the new 
matrices associated with the states of qi and R. Note 
that the decomposition is not unique. As mentioned ear¬ 
lier, the intermediate dimension is minimised by using 
a rank-revealing decomposition, and is at best the rank 
k of the left-hand side matrix. However, in this case, 
we know k is exactly the number of non-trivial states of 
R. Furthermore, we have arranged the matrix such that 


there are k columns (we can think of this as pivoting the 
trivial columns to the end of the matrix). For the sake 
of speed yet still maintaining the space performance of 
MPS, whenever one suspects an m x n matrix A having 
rank k ~ imii{m,n), as is the case here, one can use the 
trivially simple decomposition: 

, \ Ajjixn X n j ^ ^ ^ / q \ 

Amxn ~ S r A ■ 1°! 

(.!mxm A,jixn: TTl ^ n 

The above discusses how a single controlled-!/ gate is 
implemented. The circuit consists of 21 such gates, which 
differ only in their exponents and, therefore, commute. 
We find that it is never worse and often better to apply 
the I/s in decreasing powers of two. This is because the 
sequence generated by !/^* = x^' mod N, i > 0, will 
eventually lead to a cyclic subsequence. Applying the 
I/s in increasing powers saturates the rank in the fewest 
number of steps, which occurs just before the cycle re¬ 
peats. In contrast, applying the I/s in decreasing powers 
leads to performing the same operations on R at the be¬ 
ginning. Repeated operations will tend to populate the 
same states in R as previous applications, thus leads to a 
slower increase in ranks. Other orderings are also possi¬ 
ble, but we have not considered them due to foresight of 
the forthcoming QFT. The rank between the rightmost 
control qubit and R reaches r once all unique U s have 
been applied. 

Table U shows the ranks after each controlled-I/ oper¬ 
ation is applied, for the example N = 64, x = 2, and 
r = 12. Time flows down the table, reflecting applying 
the t/s in increasing powers of two. Our implementation 
places the most recently interacted control qubit furthest 
to the right, which leads to ordered qubits upon complet¬ 
ing the controlled-1/ gate sequence: qo,qi,--- , 913 - Using 
this ordering, the state can be stored using 3300 com¬ 
plex numbers, or approximately 50KB. Table [H] shows 
the same simulation parameters, now with I/s applied 
in decreasing powers, and hence reverse ordered qubits. 
Under this ordering, the same state can be stored using 
only 520 complex numbers, or approximately 8KB. 


B. Measurement of Lower Register 


Measurement of the lower register requires the proba¬ 
bility of each state i in R, which can be calculated from 
first principles using EqnO One can show that this is 


pr(i) = \aab---i\'^ = R} 





It is possible to enforce J2a = 1 for all qubits and 
qudits in the MPS by using the SVD exclusively. This 
allows one to calculate the probability of a given state of 
any qubit or qudit by using its own matrices only. How¬ 
ever, this is not true in our simulations; we have sacrificed 
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measurement speed for simpler decompositions, namely 
Eqn[51 

Having computed the probabilities, one can apply a 
projection and rescale the remaining amplitudes. Since 
the projection will reduce the entanglement in the entire 
system, one should take this opportunity to reduce the 
space consumption. We apply the RRQRF pairwise from 
right to left for this purpose. We have chosen the RRQRF 
only for its rank-revealing property and its speed over 
the SVD, and not for any structure in the results. We 
have also avoided the faster LU decomposition as it is 
considered less numerically stable. 



FIG. 2. Standard circuit implementing the quantnm Fourier 
transform on fonr qubits. Controlled-phase gates performing 
|11) —>■ exp (i7r/2") |11) are denoted by the number n in a 
square. Swap gates are denoted by two crosses joined together 
by a vertical line. Long-range gates make this circuit less than 
ideal on a MPS simulator. 


C. The Quantum Fourier Transform 

The typical circuit for the QFT is shown in Fig [2] The 
use of long-range controlled-phase gates make this circuit 
less than ideal when using a MPS simulator. Note that 
the final swap gates can be implemented classically and 
thus do not pose a problem. 

To overcome this, one can be rearrange the QFT into 
21 subcircuits, each of which includes exactly one more 
qubit than its predecessor, as shown in Fig [3] One con¬ 
tracts a single qubit into a growing qudit at the be¬ 
ginning of each subcircuit, and applies all gates in the 
subcircuit directly onto this qudit. This arrangement is 
preferable because one eliminates the need to perform 
any matrix decompositions. Upon completion, a single 
2^hdiinensional qudit remains (i.e. a 21 qubit amplitude 
formalism state vector). 

Another approach is to add swap gates after each 
controlled-phase gate, akin to implementing the QFT on 
a Id quantum computer with only nearest neighbour in¬ 
teractions, as shown in Fig |4l This approach is initially 
slower, however, it may lead to better space performance 
when coupled to the SVD due to truncation of insignifi¬ 
cant eigenvalues. 


D. Benchmarks 



FIG. 3. QFT rearranged to be more compatible with MPS. 
Dashed boxes show the repetitive pattern of the subcircuits. 
No matrix decompositions are required, and all qubits in¬ 
volved are contracted into a single qudit at the completion of 
the circuit. 



FIG. 4. QFT for a Id nearest-neighbour quantum computer 
is also suitable for MPS. Matrix decompositions are only re¬ 
quired after every swap operation. 


Table nni shows the CPU times in seconds to simu¬ 
late Shor’s algorithm for various N and r for each of 
the described stages in the algorithm: tu to apply all 
controlled-U gates, fmeas to measure the lower register 
and minimise the ranks, and Iqft to apply the QFT (us¬ 
ing Fig El). As the memory and time depends heavily 
on r, X has been chosen ahead of time to maximise r 
for the given N, and we have also chosen cases where 
r Ki N/2. All simulations use a single core and double 
precision complex numbers; the I = 12 case would require 
1TB RAM on an amplitude formalism state vector sim¬ 
ulator. The I < 12 cases take minutes to run on a laptop 
(Intel Core 2 Duo P8400 @ 2.26GHz, 2GB RAM), and 
the I > 13 cases take hours on a cluster (AMD Opteron 
@ 2.5GHz, 32GB RAM). For comparison, I = 12 cases 
take 970s (IBM Regatta p690-l-, 512 processes) and 170s 


(IBM BlueGene/L, 4096 processes) on very highly paral¬ 
lelised amplitude formalism simulators Q , corresponding 
to days in CPU time. The same software has been used 
to simulate I = 14 cases, or equivalently 42 qubits (64TB 

RAM) a. 


VI. MULTI-PROCESSOR IMPLEMENTATION 

Simulations of Shor’s algorithm benefit immensely 
from storing the state as a MPS. However, single process 
simulations will ultimately be memory limited, which is 
typically resolved by distributed computing. We show 
that MPS is well suited to this regime. We use the 
Message Passing Interface (MPI) to manage inter- 
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1 

11 

12 

13 

14 

N 

2033 

4063 

8189 

16351 


(19 x 107) 

(17 x 239) 

(19 X 431) 

(83 X 197) 

X 

2 

3 

10 

2 

r 

954 

1904 

3870 

8036 

tu 

0.2 

0.1 

2.6 

3.8 

tmesbs 

49 

76 

4843 

12, 799 

tQFT 

11 

6.7 

962 

1949 

^total 

60 

83 

5808 

14, 752 


TABLE III. CPU time in seconds the distinct steps of the 
order-finding circuit (Fig[T|): tu to complete all controlled- 
f/s; tmeas to measure out the lower register; and, Iqft perform 
the QFT. ftotai is the total time, x is chosen to maximise the 
order r for the given N. r determines the maximum size of 
the matrices before the QFT. 

process communication, the PBLAS library [13 for basic 
distributed matrix operations, and ScaLAPACK [I^ for 
linear algebra. 

We assume that our processes are arranged into a 2d 
grid. For simplicity, we shall also assume that the num¬ 
ber of processes riproc is a power of two. The matrices 
associated with the qudits are divided and distributed to 
these processes. For our simulations, we have chosen to 
divide matrices into 16 x 16 blocks, which are distributed 
cyclically. It is straightforward to port the program to 
this environment; many-qudit operations under MPS are 
simply some combination of matrix multiplications, re¬ 
arranging or adding sub-matrices, and matrix decompo¬ 
sition. 

The simple data distribution described above works 
well for the majority of the simulation as the matrices 
are large and relatively square (Table |T|. However, one 
must take care during the QFT (Fig|3|). This is because 
the circuit combines many qubits to form a very high 
dimensional qudit, with the most extreme case appear¬ 
ing upon completion, a single 2^h(jiniensional qudit. One 
can easily produce unevenly distributed data, for exam¬ 
ple, by creating either 2^^ 1 x 1 matrices (all data stored 
in a single process), or a single 2^hcJimensional row or 
column vector (data distributed along a single row or 
column only). To ensure a balanced data distribution 
throughout the circuit, we set the matrix’s origin (i.e. 
the process containing the first element) to be the pro¬ 
cess whose rank is equal to the first log 2 (nproc) bits of 
the state. This is analogous to the distribution mechanic 
of large amplitude formalism simulators, where the most 
significant log 2 (nproc) bits of the state determine the pro¬ 
cess in which it is stored. 

We have further divided the matrices into two aligned 
matrices, for the cases where the most significant qubit is 
|0) and |1). Since the most significant qubit is the qubit 
to which the H gate is applied in each of the subcircuits 
in Fig [31 this allows H to be applied locally. Note that. 


1 

N 

^proc 

tu 

^meas 

tQFT 

^total 

13 

8189 

1 

3.4 

6763 

1539 

8305 



2 

2.1 

3378 

511 

3891 



4 

1.3 

1636 

253 

1890 



8 

0.7 

886 

128 

1015 



16 

0.5 

452 

76 

529 

14 

16,351 

1 

6.3 

13,173 

3909 

17,088 



2 

3.6 

5588 

1430 

7022 



4 

1.8 

2769 

788 

3558 



8 

1.2 

1471 

356 

1829 



16 

0.7 

741 

172 

914 

15 

32,663 

4 

5.2 

7510 

1934 

9449 



8 

3.1 

3239 

788 

4031 


TABLE IV. Wall-clock time in seconds for the distinct steps of 
the order-finding circuit for three cases: I — N = 8189 = 
19 X 431, X = 10, r = 3870; Z = 14, A = 16351 = 83 x 197, 
X = 2, r = 8036; and, I — 15, N = 32663 = 89 x 367, x = 6, 
r = 16104. Wproc is the number of MPI processes or CPU 
cores. 


as is the case for the amplitude formalism, the controlled- 
phase gates can also be applied without additional data 
transfer. 

Table HV] shows the wall-clock times in seconds for 
our implementation. These were run on the same clus¬ 
ter mentioned in the single process simulations (AMD 
Opteron @ 2.5GHz). The largest case simulated was 
I = 15, using approximately 16GB RAM spread over 8 
processes and taking approximately one hour. Because 
these simulations were run on a shared cluster, the execu¬ 
tion time can deviate by 25% or more, depending on the 
load on the nodes and the assignment of nodes. Thus the 
listed times are the lowest from each case, obtained from 
a small number of runs. The results indicate good scaling 
behaviour, doubling the number of processes leads to a 
40-60% decrease in the total execution time. 

We also mention here that there is another difference 
between these simulations and those of Table imi Re¬ 
call that after the measurement projection, we apply a 
rank-revealing decomposition across the MPS to reduce 
the matrix sizes. Unfortunately, the QR decomposition 
from ScaLAPAGK 1.8.0 (pzgeqpf) appears to not always 
choose pivots such that the diagonal (leading) elements of 
R are in order of descending magnitude, as described in 
[l8l |. This appears to be a bug, and the behaviour means 
we are unable to determine the numerical rank through 
this routine. Thus in these benchmarks, we have instead 
opted to use the SVD (pzgesvd). 
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VII. CONCLUSION 

We have shown that, in order to factorise an ^-bit num¬ 
ber N with order r, a MPS approach reduces the space 
complexity from 0(2^^) to 0(max(4Zr^, 2^*)) in the case 
of dense storage. This is a significant improvement as r is 
typically much smaller than N, allowing one to simulate 
circuits with as many as 36 qubits on a standard laptops. 
Our largest simulation on a single processor of 42 qubits 
required less than 32GB RAM. 

We have also ported our simulations to use MPI. We 
have simulated circuits with as many as 45 qubits over 
8 CPU cores. Since MPS is built around matrix prod¬ 
ucts, the distributed matrix approach to parallelisation 
understandably works well. Our benchmarks show near 


l/'^proc scaling. 

One can extend these simulations, for example, by im¬ 
plementing the circuits for the controlled-unitaries, and 
by incorporating decoherence. Other quantum circuits 
with the same structure as the order-finding circuit are 
likely to benefit from an MPS approach, and many of 
the optimisations described here would be applicable to 
these and other MPS simulations. 
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